-
Notifications
You must be signed in to change notification settings - Fork 113
[feat] Adding support for PostgreSQL storage via SQLAlchemy #3567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…e-only behaviour and schema layout. - Introduce _ConnectionStrategy class and _SqliteConnector sub-classe to encapsulate URL building, engine creation, and per-dialect kwargs. - Build schema with SQLAlchemy Core (MetaData, Table, Column), preserving table/column names and the index_testcases_time index. - Convert INSERT, SELECT, DELETE queries to Core expressions. - Enable SQLite foreign keys via PRAGMA foreign_keys=ON on each connection (matches previous cascade semantics). - Replace raw sqlite3 connection/locking with SQLAlchemy transactional contexts (engine.begin()) - Drop legacy file lock code and direct sqlite3.connect() calls. - Remove uuids_sql raw string building in remove paths; use SQLAlchemy's in_() instead.
- Introduce _PostgresConnector and wire backend='postgresql' in StorageBackend.create(). - Read connection parameters from site config: storage/0/postgresql_driver, postgresql_host, postgresql_port, postgresql_db, postgresql_conn_timeout and credentials from env: RFM_POSTGRES_USER / RFM_POSTGRES_PASSWORD. - Pass connect_timeout via DBAPI connect args. - Extend config schema (schemas/config.json) with the new Postgres options and add sensible defaults (driver=psycopg2, conn_timeout=60). - Keep SQLite as-is; no behavior change for existing users. - Add psycopg2-binary==2.9.8 to requirements.txt to provide the PG driver.
- Add ConnectionStrategy.json_column_type property (default Text, overridden to JSONB for Postgres). - Use this type in schema so Postgres stores sessions.json_blob as JSONB. - Normalize JSONB values in _decode_sessions() by serializing non-strings to JSON strings for consistent downstream parsing. - CLI: display backend info depending on storage backend (SQLite file path vs PostgreSQL host:port/db). - write report as encoded JSON if postgresql
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #3567 +/- ##
============================================
- Coverage 91.22% 69.04% -22.18%
============================================
Files 62 60 -2
Lines 13525 13483 -42
============================================
- Hits 12338 9310 -3028
- Misses 1187 4173 +2986 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a quick look over the key idea and I generally agree with the abstraction. I do still have some questions though that I'd like to be clarified and tested before moving forward, because we want to ensure the current functionality and performance is not compromised (check my comments in the code).
Also, how easy would it be with this design to add another backend, such as DuckDB?
# Update DB file mode | ||
os.chmod(self.__db_file, self.__db_file_mode) | ||
|
||
def _db_create_indexes(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are the indexes created in the new design? These are crucial for the performance of time-based queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- During the
_db_schema()
call, the full database schema (tables, columns, foreign keys and indexes) is declared:
def _db_schema(self):
self.__sessions_table = Table(
"sessions", self.__metadata,
Column("uuid", Text, primary_key=True),
Column("session_start_unix", Float),
...,
Index("index_sessions_time", "session_start_unix"),
)
self.__testcases_table = Table(
"testcases", self.__metadata,
Column("job_completion_time_unix", Float),
...,
Index("index_testcases_time", "job_completion_time_unix"),
)
The schema definitions are the translated into SQL statements and executed against the database when _db_create()
calls:
self.__metadata.create_all(self.__connector.engine)
The create_all()
is idempotent and creates any missing tables/columns/indexes without altering existing ones.
- As a quick check I have inspected the SQLite database generated when running
./test_reframe.py -- unittests/test_reporting.py
and this shows the indexes are there
sqlite3 /tmp/pytest-of-mredenti/pytest-0/test_storage_api0/.reframe/reports/results.db
SQLite version 3.45.1 2024-01-30 16:01:20
Enter ".help" for usage hints.
sqlite> .schema
CREATE TABLE sessions (
uuid TEXT NOT NULL,
session_start_unix FLOAT,
session_end_unix FLOAT,
json_blob TEXT,
report_file TEXT,
PRIMARY KEY (uuid)
);
CREATE INDEX index_sessions_time ON sessions (session_start_unix);
CREATE TABLE metadata (
schema_version TEXT
);
CREATE TABLE testcases (
name TEXT,
system TEXT,
partition TEXT,
environ TEXT,
job_completion_time_unix FLOAT,
session_uuid TEXT,
uuid TEXT,
FOREIGN KEY(session_uuid) REFERENCES sessions (uuid) ON DELETE CASCADE
);
CREATE INDEX index_testcases_time ON testcases (job_completion_time_unix);
Off course, perhaps it is worth to add a test that queries the database and indeed verifies the schema matches the intended one, including verifying the correct indexes have been created.
@contextlib.contextmanager | ||
def _db_read(self, *args, **kwargs): | ||
with self.__db_lock.read_lock(): | ||
with self._db_connect(*args, **kwargs) as conn: | ||
yield conn | ||
|
||
@contextlib.contextmanager | ||
def _db_write(self, *args, **kwargs): | ||
with self.__db_lock.write_lock(): | ||
with self._db_connect(*args, **kwargs) as conn: | ||
yield conn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we guarantee a similar locking mechanism in the new design? These are needed to avoid "database locked errors" when having multiple ReFrame instances updating the DB using the SQLite backend. And, generally, the redesign should allow to extend the underlying backend capabilities with extra functionalities, like here with explicit read/write locking.
Also, regarding this:
I don't believe this is necessary for two reasons:
|
With this refactor, introducing a new SQL-based backend (e.g. DuckDB) requires implementing a new connector subclass: class _DuckConnector(_ConnectionStrategy):
def __init__(self):
...
def _build_connection_url(self):
return URL.create(
drivername="duckdb",
database="/path/to/results.duckdb"
).render_as_string() The idea is indeed to be able to add a new backend in an isolated manner, independently of the storage and SQL logic. So, the Now, once a new dialect has been introduced, the only additional change required in elif backend == 'duckdb':
return _SqlStorage(_DuckConnector(), *args, **kwargs) Commit e162185 shows this for PostgreSQL. |
Always serialize the report object to JSON text using `jsonext.dumps()` before inserting into the database. This prevents `TypeError: Object of type RunReport is not JSON serializable` errors when using the PostgreSQL backend (JSONB column). For SQLite, we continue storing the JSON string.
This PR introduces SQLAlchemy Core to abstract database interactions in Reframe's storage backend, adding support for PostgreSQL alongside SQLite. It serves as a proof-of-concept to gather feedback on this approach before adding comprehensive tests and documentation. I have manually tested PostgreSQL with
psycopg2
for schema creation, storage, and CLI output. Full PostgreSQL tests intest_storage.py
will follow approval.Motivation
Supporting multiple SQL dialects (SQLite, PostgreSQL, MySQL) is complex due to differences in:
psycopg2
orpg8000
)SQLAlchemy Core unifies these interactions, improving maintainability and extensibility.
Next Steps (pending feedback)
Note Users can replace
psycopg2
with their preferred PostgreSQL driver (e.g.,pg8000
) by updatingrequirements.txt
andpostgresql_driver
in the system configuration. Suggestions on test cases or PostgreSQL edge cases are welcome.